Supervised Classification

Since results of unsupervised clustering are not quite satisfying and human-created training data is available, supervised classification may be a way to achieve more accurate results.

## [1] 24000

Using packages “sp” and “rgdal”, shape files can be read which contain geometric shapes (in ths case points) associated with classes. The first line of the following reads the shapes. The labels of each point are then converted to a factor, which contains levels for each label, thereby recognizing the different categories.

allTrainShapes <- readOGR(dsn="C:\\Users\\D059331\\Desktop\\DM GIC\\data\\shp\\trn", layer="J_treino_QB_Tot_point")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\D059331\Desktop\DM GIC\data\shp\trn", layer: "J_treino_QB_Tot_point"
## with 483173 features
## It has 3 fields
trainFactor <- as.factor(allTrainShapes$Label)
length(levels(trainFactor))
## [1] 10

As shown in the output, there are 10 different labels in the training set. A color vector of 10 different colors is created in order to plot the different labels on the image.

colors <- c("papayawhip", "dodgerblue3", "black", "tan3", "tan4", "forestgreen", "darkgreen", "palegreen2", "springgreen3", "darkolivegreen")
# Plot the raster
plotRGB(rasterJ,3,2,1)
plot(allTrainShapes, add=TRUE, col=colors[trainFactor])

Before fitting a model, the raster is divided up into multiple sectors. Some of those sectors can then be used for training and others for validation. The selected sectors are all the same size and were created by moving from the top left corner to the bottom, the the right and diagonally. Approximately 40% of the sectors shall be used for validation. The following code creates the subrasters. Then, the training data is extracted at the location of reach sector and is plotted on top of each sector.

######################## Problem: some have NA's!
trainingExtents <- c(
    extent(-105000, -103000, -42000, -40000),
    extent(-105000, -103000, -44000, -42000),
    extent(-105000, -103000, -46000, -44000),
    extent(-103000, -101000, -42000, -40000),
    extent(-103000, -101000, -44000, -42000),
    extent(-103000, -101000, -50000, -48000),
    extent(-101000, -99000, -42000, -40000),
    extent(-101000, -99000, -48000, -46000),
    extent(-101000, -99000, -50000, -48000),
    extent(-99000, -97000, -46000, -44000),
    extent(-99000, -97000, -48000, -46000),
    extent(-99000, -97000, -50000, -48000),
    extent(-97000, -95000, -44000, -42000),
    extent(-97000, -95000, -46000, -44000),
    extent(-97000, -95000, -48000, -46000)
)
for(i in 1:length(trainingExtents)) {
    trainingShapesAtSector <- crop(allTrainShapes, trainingExtents[[i]])
    trainingSector <- crop(rasterJ, trainingExtents[[i]])
    plotRGB(trainingSector,3,2,1)
    plot(trainingShapesAtSector, add=TRUE, col=colors[trainFactor])
}

validationExtents <- c(
    extent(-105000, -103000, -48000, -46000),
    extent(-105000, -103000, -50000, -48000),
    extent(-103000, -101000, -46000, -44000),
    extent(-103000, -101000, -48000, -46000),
    extent(-101000, -99000, -44000, -42000),
    extent(-101000, -99000, -46000, -44000),
    extent(-99000, -97000, -42000, -40000),
    extent(-99000, -97000, -44000, -42000),
    extent(-97000, -95000, -42000, -40000),
    extent(-97000, -95000, -50000, -48000)
)
for(i in 1:length(validationExtents)) {
    validationShapesAtSector <- crop(allTrainShapes, validationExtents[[i]])
    validationSector <- crop(rasterJ, validationExtents[[i]])
    plotRGB(validationSector,3,2,1)
    plot(validationShapesAtSector, add=TRUE, col=colors[trainFactor], pch=".")
}

The next step is to fit a model by training with one sector in the training set. This study uses the random forest classifier.

trainPos <- 1
trainingSector <- crop(rasterJ, trainingExtents[[trainPos]])
trainingShapesAtSector <- crop(allTrainShapes, trainingSector)
trainingSectorValues <- extract(trainingSector, trainingShapesAtSector)
## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster
trainingDataFrame <- data.frame(trainingSectorValues)
forest <- randomForest(
    x=trainingDataFrame,
    y=as.factor(trainingShapesAtSector$Label)
)

The above code trained the classifier with only one training sector. This classifier can be used to predict labels for new data. In the following, the random forest model is used to predict the sector it has been trained with and plots the prediction results.

prediction <- predict(trainingSector, forest, type="response",  na.rm=TRUE)
plotRGB(trainingSector, 3, 2, 1)

plot(prediction)

Just by observing the picture it is apparent that the model somewhat recognizes the landscape. Since there are correct values available for some points (that is, our training data) we can calculate the error rate on those points. The error can be calculated by counting the missclassifications and then dividing that number by the total number of training points available in the sector.

calculateError <- function(prediction, actual) {
    prediction <- extract(prediction, actual)
    trainDiff <- prediction - actual
    trainDiffCount <- 0
    for(i in 1:length(prediction)) {
        if(trainDiff[i] != 0) {
            trainDiffCount <- trainDiffCount + 1
        }
    }
    return(trainDiffCount  / length(trainDiff))
}
# Create a matrix to collect training and validation errors
errorTrend <- matrix(ncol=length(trainingExtents), nrow=2, dimnames=list(c("Error in Training Set","Error in Validation Set")))
errorTrend[1,trainPos] <- calculateError(prediction, trainingShapesAtSector$Label)
errorTrend[1,trainPos]
## Error in Training Set 
##            0.06054854

Next, the error in the validation set will be measured. The validation set will be predicted, then the error of all known points will be computed.

errorSum <- 0
mergedExtent <- do.call(merge, validationExtents)
validationSector <- crop(rasterJ, mergedExtent)
validationShapesAtSector <- crop(allTrainShapes, validationSector)
prediction <- predict(validationSector, forest, type="response", na.rm=TRUE)
errorSum <- errorSum + calculateError(prediction, validationShapesAtSector$Label)
errorTrend[2,trainPos] <- errorSum / length(validationExtents)
errorTrend[2,trainPos]
## Error in Validation Set 
##              0.03444375

The error value in the validation set is much higher, which was to be expected since the training set is naturally easier to predict for a model that has been trained with exactly that dataset. To prevent overfitting the model, both error rates must be observed. It is expected that both error rates will drop with more training, until the validation error will start to rise again. That point indicates, that the model should stop training in order to not overfit. The next code block trains the forest with the next piece of training data and then determines the prediction errors in training and validation sets.

#for(trainPos in 2:length(trainingSectors)) {
trainPos <- 2

mergedExtent <- do.call(merge, trainingExtents[1:4])
trainingSector <- crop(rasterJ, mergedExtent)
trainingShapesAtSector <- crop(allTrainShapes, trainingSector)
trainingSectorValues <- extract(trainingSector, trainingShapesAtSector)
## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster
trainingDataFrame <- data.frame(trainingSectorValues)
forest <- randomForest(
    x=trainingDataFrame,
    y=as.factor(trainingShapesAtSector$Label)
)
prediction <- predict(trainingSector, forest, type="response", na.rm=TRUE)
errorTrend[1,trainPos] <- calculateError(prediction, trainingShapesAtSector$Label)
errorTrend[1,trainPos]
## Error in Training Set 
##             0.1575777